M.Sc Data Science


J.Francisco Munoz-Elguezabal - franciscome@iteso.mx

Predictive Models for Financial TimeSeries Data: A Convex Optimization Perspective.

Github Repository: Link

1. Abstract


Financial timeseries prediction is a particularly hard task, and because of that, Machine Learning tools are up to the task, particularly in the proposal of feature engineering process, predictive modeling and hyperparameter optimization. In this work we have shown a process to build 3 types of models for classification: OLS Regression with Elastic Net Regularization, L1 Support Vector Machines and an Aritificial Neural Network (Multilayer perceptron), but also, a genetic programming approach to generate, from linear features, a non linear feature engineering process. Also we have found that using the sign output of the OLS Regression produces very poor results for classification, but had very good results for l1-SVM and ANN-MLP. Finally we include results above 90% out of sample classification accuracy.

2. Introduction


Timeseries data prediction using machine learning techniques is a complex problem, one that can be defined with 3 challenges:

  1. Feature engineering and feature selection
  2. Model definition and hyperparameter optimization
  3. Crossvalidation

The feature engineering and selection processes, when one is working with timeseries data, particular challenges arise when endogenous explanatory variables (features) are constructed, the main one would be multicollinearity among candidate features. In the model definition process, the bias-variance trade-off is always a present dilema, also with the desire of having a low model complexity and high explainability. Finally, crossvalidation techniques are quite different for timeseries data in contrast with panel data, mainly because the importance of the order in the data and its "memory" properties like the commonly autoregressive behavior.

In the context of machine learning models, either for regression or classification of a target variable, an important aspect is of interest: The definition of the model as a convex or non convex formulation of the regression/classification problem. If the model is based in a convex formulation a unique solution is guaranteed to be found in the optimization process, where the non convex doesnt have that guarantee.

3. Problem Description


In Machine Learning applications for timeseries prediction there can be serious consequences of not having a robust modelling process, from poor performance to overtitting. Some reasons of could be not conducting a valid feature engineering process in order to have enough information for the model to predict, with a poor model definition, or worse, a very complex and very unexplainable model, combined with not considering the "memory" in the data in the crossvalidation process.

To motivate this work, with the particular focus on the convex optimization theory, we choose the three identified problems in the previous section: Feature engineering and feature selection, Model definition and hyperparameter optimization and Crossvalidation, to propose the following research question, three hypotheses and a general experiment to test them.

Research Question

From a Convex Optimization perspective, Which would be the processes to be performed in order to fit predictive linear models for financial timeseries data, using linear and non-linear endogenous features ?

Hypotheses

  1. There is a feature engineering process to generate both linear and non-linear explanatory variables that uses endogenous information and contributes to a +70% accuracy
  2. Linear Kernel is the option with less out of sample error for L1-SVM
  3. +70% performance of both linear models with linear and non linear variables with any other type of Kernel for SVM.

Experiment

  • Use Financial Time Series Data (Daily exchange rate Usd/Mxn)
  • Use a feature engineering processes with three types of variables
  • Define the prediction problem as a classification problem
  • Implement OLS Multivariate Regression with Elastic Net Regularization model
  • Implement a L1-SVM for the non-separable case model with Linear, Polynomial and RBF Kernels.

4. Justification


The reason why this work is relevant in a practical perspective, because it is interesting to test whether a non linear transformation of linear variables will be sufficient in order to have linear model and performe sufficiently well. Both, from the perspective of an OLS + ElasticNetc regularization problem for sign prediction, and SVM for classification.

5. Objectives


Main Objective

Build a predictive model for financial timeseries data, using only endogenous variables, and using a classification approach.

Specific Objectives

  1. Build and visualize results of Feature Engineering process
  2. Build and visualize results of adjusting an OLS with Elastic Net Regression model for sign classification.
  3. Build and visualize results of L1 Support Vector Machines for sign classification.

6. Materials and Methods (Code execution)


6.0 Install dependencies

In [1]:
# %%capture is to hide the results of the execution 
#%%capture
# !pip install -r requirements.txt

6.1 Code initialization

Import libraries and other project scripts

In [2]:
# -- Import libraries for this notebook
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore")

# -- to use in google colab only
# from IPython.display import Math, HTML

# -- Import the projects scripts
import functions as fn           # feature engineering and processes
import data as dt                # input and out data processes
import visualizations as vs      # all the plots

# -- to visualize offline plots inside jupyte notebook
from plotly.offline import iplot

6.2 Data Folds & Exploratory Data Analysis

The data used in this project is the Future Contract price for the Usd/Mxn exchange rate. Obtained from the Chicago Mercantile Exchange Group, for the past almost 4 years, from 2017-01-02 00:00:00 to the 2020-10-30 00:00:00. We have the OHLC (Open, High, Low, Close Volume) data.

In [3]:
# general data
data_ohlc = dt.ohlc_data.copy()

# print the first 5 rows
data_ohlc.head(5)
Out[3]:
timestamp open high low close volume
0 2017-01-02 26.68090 26.69514 26.63116 26.63116 808
1 2017-01-03 26.62407 27.32240 26.57454 27.24053 59760
2 2017-01-04 27.24796 28.12148 27.13704 27.87845 57761
3 2017-01-05 27.87845 28.08989 27.35230 27.90957 100513
4 2017-01-06 27.89400 27.80095 27.45744 27.48008 43104
In [4]:
# print the last 5 rows
data_ohlc.tail(5)
Out[4]:
timestamp open high low close volume
1188 2020-10-26 21.07038 21.24947 21.02607 21.04377 51816
1189 2020-10-27 21.03934 21.20441 20.94241 21.16402 42135
1190 2020-10-28 21.16850 21.47766 21.12379 21.30379 85099
1191 2020-10-29 21.30379 21.58429 21.23593 21.43163 63725
1192 2020-10-30 21.43163 21.60294 21.27660 21.28565 52380
In [5]:
# all data description
table_1 = data_ohlc.describe()

# 
table_1
Out[5]:
open high low close volume
count 1193.000000 1193.000000 1193.000000 1193.000000 1193.000000
mean 21.990627 22.125658 21.864194 21.986336 40439.915339
std 1.625144 1.674188 1.571323 1.619945 24604.121939
min 19.204920 19.230770 19.197540 19.201230 445.000000
25% 20.955570 21.052630 20.872470 20.964360 28615.000000
50% 21.630980 21.772260 21.514630 21.630980 41021.000000
75% 22.763490 22.920010 22.634680 22.758310 53614.000000
max 28.579590 28.776980 28.441410 28.579590 176260.000000

The target variable, $y_{t}$ is binary discrete representation of the price change in one day, with the form $y_{t} \in \left\{ -1, 1\right\}$, and is calculated with $sign[close - open]$

In [6]:
# produce an example dataframe
experiment_data = data_ohlc.copy()

# target variable generation
experiment_data['co'] = experiment_data['close'] - experiment_data['open']
experiment_data['co_d'] = [1 if i > 0 else -1 for i in list(experiment_data['co'])]

# shift target variable to avoid direct leakage of target variable info in features info
experiment_data['co_d'] = experiment_data['co_d'].shift(-1, fill_value=9999)

# print the first 5 rows of dataframe
experiment_data.head(5)
Out[6]:
timestamp open high low close volume co co_d
0 2017-01-02 26.68090 26.69514 26.63116 26.63116 808 -0.04974 1
1 2017-01-03 26.62407 27.32240 26.57454 27.24053 59760 0.61646 1
2 2017-01-04 27.24796 28.12148 27.13704 27.87845 57761 0.63049 1
3 2017-01-05 27.87845 28.08989 27.35230 27.90957 100513 0.03112 -1
4 2017-01-06 27.89400 27.80095 27.45744 27.48008 43104 -0.41392 1
In [7]:
# dates for every fold in order to construct the 2nd plot
dates_folds = [data_ohlc.iloc[947, 0]]

# print messages in console
print('\nThe complete time period in the data is: ', len(data_ohlc), 'market* days')
print('\nThe training period is \nfrom: ', data_ohlc.iloc[0, 0], 'to:', data_ohlc.iloc[947, 0], ',',
      948, 'market* days in total')
print('\nThe testing period is \nfrom: ', data_ohlc.iloc[948, 0], 'to:', data_ohlc.iloc[-1, 0], ',',
      238, 'market* days in total')
The complete time period in the data is:  1193 market* days

The training period is 
from:  2017-01-02 00:00:00 to: 2020-01-17 00:00:00 , 948 market* days in total

The testing period is 
from:  2020-01-19 00:00:00 to: 2020-10-30 00:00:00 , 238 market* days in total
In [8]:
# OHLC Prices with train & test vertical line division
plot_2 = vs.g_ohlc(p_ohlc=data_ohlc,
                   p_theme=dt.theme_plot_2,
                   p_vlines=dates_folds)

# show plot in explorer
iplot(plot_2)

Besides having all the generated features, we conducted a standarization process for every variable. That is, $z_{k} = \frac{x_{k} - \mu}{\sigma} $

6.3 Project Main Objective

The main objective for this project is to have a classificator that correctly predicts whether the close price in the $t+1$ will be higher or lower than the close price observed in $t$.

SymbolicTransformer fit

model_fit = model.fit_transform(p_x, p_y)

# output data of the model
data = pd.DataFrame(np.round(model_fit, 6))

# parameters of the model
model_params = model.get_params()

# best programs dataframe
best_programs = {}
for p in model._best_programs:
    factor_name = 'sym_' + str(model._best_programs.index(p))
    best_programs[factor_name] = {'raw_fitness': p.raw_fitness_, 'reg_fitness': p.fitness_, 
                                  'expression': str(p), 'depth': p.depth_, 'length': p.length_}

# formatting, drop duplicates and sort by reg_fitness
best_programs = pd.DataFrame(best_programs).T
best_programs = best_programs.drop_duplicates(subset = ['expression'])
best_programs = best_programs.sort_values(by='reg_fitness', ascending=False)

# results
results = {'fit': model_fit, 'params': model_params, 'model': model, 'data': data,
           'best_programs': best_programs, 'details': model.run_details_}

6.4 Auto regressive feature engineering

SymbolicTransformer fit

model_fit = model.fit_transform(p_x, p_y)

# output data of the model
data = pd.DataFrame(np.round(model_fit, 6))

# parameters of the model
model_params = model.get_params()

# best programs dataframe
best_programs = {}
for p in model._best_programs:
    factor_name = 'sym_' + str(model._best_programs.index(p))
    best_programs[factor_name] = {'raw_fitness': p.raw_fitness_, 'reg_fitness': p.fitness_, 
                                  'expression': str(p), 'depth': p.depth_, 'length': p.length_}

# formatting, drop duplicates and sort by reg_fitness
best_programs = pd.DataFrame(best_programs).T
best_programs = best_programs.drop_duplicates(subset = ['expression'])
best_programs = best_programs.sort_values(by='reg_fitness', ascending=False)

# results
results = {'fit': model_fit, 'params': model_params, 'model': model, 'data': data,
           'best_programs': best_programs, 'details': model.run_details_}

In time series analysis, the lag operator $L$ operates on an element of a time series to produce the previous element. For example, given a time series variable $X_t = \left\{ X_1, X_2, ... X_N \right\}$, then $LX_t = L X_{t-1} \forall t \geq 1$.

And the moving average is simply defined by the average of a particular column that has a window of an arbitrary information, in the case of this project, the window size (memory) of the time series is 7.

In [9]:
# For the autoregressive feature engineering process
p_memory = 7

# Data with autoregressive variables
data_ar = fn.f_autoregressive_features(p_data=data_ohlc, p_nmax=p_memory)

# Dependent variable (Y) separation
data_y = data_ar['co_d'].copy()

# Timestamp separation
data_timestamp = data_ar['timestamp'].copy()

# Independent variables (x1, x2, ..., xn)
data_ar = data_ar.drop(['timestamp', 'co', 'co_d'], axis=1, inplace=False)

# print dataframe
data_ar
Out[9]:
lag_vol_1 lag_ol_1 lag_ho_1 lag_hl_1 ma_vol_1 ma_ol_1 ma_ho_1 ma_hl_1 lag_vol_2 lag_ol_2 ... ma_ho_6 ma_hl_6 lag_vol_7 lag_ol_7 lag_ho_7 lag_hl_7 ma_vol_7 ma_ol_7 ma_ho_7 ma_hl_7
0 22189.0 151.2 2982.0 3133.2 38320.0 153.9 7512.0 7665.9 1072.0 301.8 ... 3515.733333 5406.266667 808.0 497.4 142.4 639.8 46102.714286 1691.214286 4011.100000 5702.314286
1 38320.0 153.9 7512.0 7665.9 77283.0 724.1 3759.6 4483.7 22189.0 151.2 ... 2686.466667 4512.816667 59760.0 495.3 6983.3 7478.6 48606.000000 1723.900000 3550.571429 5274.471429
2 77283.0 724.1 3759.6 4483.7 38676.0 5369.7 409.0 5778.7 38320.0 153.9 ... 2402.233333 4246.616667 57761.0 1109.2 8735.2 9844.4 45879.571429 2332.542857 2361.114286 4693.657143
3 38676.0 5369.7 409.0 5778.7 40688.0 5368.9 888.1 6257.0 77283.0 724.1 ... 2705.333333 4716.933333 100513.0 5261.5 2114.4 7375.9 37333.142857 2347.885714 2185.928571 4533.814286
4 40688.0 5368.9 888.1 6257.0 540.0 700.1 -78.0 622.1 38676.0 5369.7 ... 2578.783333 4656.766667 43104.0 4365.6 -930.5 3435.1 31252.571429 1824.242857 2307.714286 4131.957143
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1181 3655.0 44.0 884.6 928.6 51816.0 443.1 1790.9 2234.0 35328.0 1505.6 ... 913.166667 2077.750000 2126.0 405.2 225.8 631.0 40042.857143 1101.128571 1029.928571 2131.057143
1182 51816.0 443.1 1790.9 2234.0 42135.0 969.3 1650.7 2620.0 3655.0 44.0 ... 1096.933333 2114.433333 49491.0 720.4 1730.5 2450.9 38992.000000 1136.685714 1018.528571 2155.214286
1183 42135.0 969.3 1650.7 2620.0 85099.0 447.1 3091.6 3538.7 51816.0 443.1 ... 1310.916667 2320.966667 41672.0 1851.8 548.1 2399.9 45195.857143 936.014286 1381.885714 2317.900000
1184 85099.0 447.1 3091.6 3538.7 63725.0 678.6 2805.0 3483.6 42135.0 969.3 ... 1770.833333 2452.116667 48311.0 491.8 1807.7 2299.5 47397.857143 962.700000 1524.357143 2487.057143
1185 63725.0 678.6 2805.0 3483.6 52380.0 1550.3 1713.1 3263.4 85099.0 447.1 ... 1989.316667 2678.050000 50027.0 2651.2 45.5 2696.7 47734.000000 805.428571 1762.585714 2568.014286

1186 rows × 56 columns

6.5 Hadamard feature engineering

In mathematics, the Hadamard product (also known as the element-wise product) is a binary operation that takes two matrices of the same dimensions and produces another matrix of the same dimension as the operands, where each element $i$, $j$ is the product of elements $i$, $j$ of the original two matrices.

For two matrices $A$ and $B$ of the same dimension $m \times n$, the Hadamard product $(A \circ B)$ is a matrix of the same dimensions as the operands, with the elements given by:

\begin{equation} (A \circ B)_{ij} = (A \odot B)_{ij} = (A)_{ij}(B)_{ij} \end{equation}

For matrices of different dimensions $(A_{ij} \times B_{ij})$ the Hadamard product is undefined.

So, since in this particular case, the feature we have generated previously (Autoregressive features) are always of the same dimensions, because $rows=days$ and $columns=1$, if we take two columns of the data set, that is, all of the data points of two features, the Hadamard product will be defined and the calculation of the element-wise multiplication of each row of those two features columns.

In [10]:
# Data with Hadamard product variables
data_had = fn.f_hadamard_features(p_data=data_ar, p_nmax=p_memory)

# print result
data_had
Out[10]:
lag_vol_1 lag_ol_1 lag_ho_1 lag_hl_1 ma_vol_1 ma_ol_1 ma_ho_1 ma_hl_1 lag_vol_2 lag_ol_2 ... h_lag_vol_7_ma_hl_7 h_lag_ol_7_ma_ol_7 h_lag_ol_7_ma_ho_7 h_lag_ol_7_ma_hl_7 h_lag_ho_7_ma_ol_7 h_lag_ho_7_ma_ho_7 h_lag_ho_7_ma_hl_7 h_lag_hl_7_ma_ol_7 h_lag_hl_7_ma_ho_7 h_lag_hl_7_ma_hl_7
0 22189.0 151.2 2982.0 3133.2 38320.0 153.9 7512.0 7665.9 1072.0 301.8 ... 4.607470e+06 8.412100e+05 1.995121e+06 2.836331e+06 2.408289e+05 5.711806e+05 8.120096e+05 1.082039e+06 2.566302e+06 3.648341e+06
1 38320.0 153.9 7512.0 7665.9 77283.0 724.1 3759.6 4483.7 22189.0 151.2 ... 3.152024e+08 8.538477e+05 1.758598e+06 2.612446e+06 1.203851e+07 2.479471e+07 3.683322e+07 1.289236e+07 2.655330e+07 3.944566e+07
2 77283.0 724.1 3759.6 4483.7 38676.0 5369.7 409.0 5778.7 38320.0 153.9 ... 2.711103e+08 2.587257e+06 2.618948e+06 5.206205e+06 2.037523e+07 2.062481e+07 4.100003e+07 2.296248e+07 2.324375e+07 4.620624e+07
3 38676.0 5369.7 409.0 5778.7 40688.0 5368.9 888.1 6257.0 77283.0 724.1 ... 4.557073e+08 1.235340e+07 1.150126e+07 2.385466e+07 4.964370e+06 4.621927e+06 9.586297e+06 1.731777e+07 1.612319e+07 3.344096e+07
4 40688.0 5368.9 888.1 6257.0 540.0 700.1 -78.0 622.1 38676.0 5369.7 ... 1.781039e+08 7.963915e+06 1.007456e+07 1.803847e+07 -1.697458e+06 -2.147328e+06 -3.844786e+06 6.266457e+06 7.927229e+06 1.419369e+07
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1181 3655.0 44.0 884.6 928.6 51816.0 443.1 1790.9 2234.0 35328.0 1505.6 ... 4.530627e+06 4.461773e+05 4.173271e+05 8.635044e+05 2.486348e+05 2.325579e+05 4.811927e+05 6.948121e+05 6.498849e+05 1.344697e+06
1182 51816.0 443.1 1790.9 2234.0 42135.0 969.3 1650.7 2620.0 3655.0 44.0 ... 1.066637e+08 8.188684e+05 7.337480e+05 1.552616e+06 1.967035e+06 1.762564e+06 3.729598e+06 2.785903e+06 2.496312e+06 5.282215e+06
1183 42135.0 969.3 1650.7 2620.0 85099.0 447.1 3091.6 3538.7 51816.0 443.1 ... 9.659153e+07 1.733311e+06 2.558976e+06 4.292287e+06 5.130294e+05 7.574116e+05 1.270441e+06 2.246341e+06 3.316388e+06 5.562728e+06
1184 85099.0 447.1 3091.6 3538.7 63725.0 678.6 2805.0 3483.6 42135.0 969.3 ... 1.201522e+08 4.734559e+05 7.496788e+05 1.223135e+06 1.740273e+06 2.755580e+06 4.495853e+06 2.213729e+06 3.505259e+06 5.718988e+06
1185 63725.0 678.6 2805.0 3483.6 52380.0 1550.3 1713.1 3263.4 85099.0 447.1 ... 1.284701e+08 2.135352e+06 4.672967e+06 6.808319e+06 3.664700e+04 8.019765e+04 1.168446e+05 2.171999e+06 4.753165e+06 6.925164e+06

1186 rows × 140 columns

6.6 Genetic Programming for Symbolic Feature Engineering

The following operations were performed with the previously generated features, some of the operations has afinity of 1, like the Inverse.

The used symbolic operations were:

  • Substraction (sub) : $A - B$
  • Addition (add) : $A + B$
  • Multiplication (mul) : $A \cdot B$
  • Inverse (inv) : $A^{-1}$
  • Absolute value (abs) : $|A|$
  • Logarithm (log) : $log(A)$
In [11]:
# -- -------------------------------------------------------- Symbolic Features Generator (Run it once) --- #

# run variable generating function
fun_sym = fn.symbolic_features(p_x=data_had, p_y=data_y)
    |   Population Average    |             Best Individual              |
---- ------------------------- ------------------------------------------ ----------
 Gen   Length          Fitness   Length          Fitness      OOB Fitness  Time Left
   0  1920.79        0.0607073        3         0.719074              N/A     11.37m
   1     7.15         0.321747        3          0.75442              N/A      1.55m
In [100]:
print('The dimensions of the resulting DataFrame with the BEST programms: ' + str(fun_sym['best_programs'].shape), '\n')
The dimensions of the resulting DataFrame with the BEST programms: (26, 5) 

In [101]:
display(fun_sym['best_programs'])
raw_fitness reg_fitness expression depth length
sym_0 0.754420 0.751420 div(ma_ol_1, ma_hl_1) 1 3
sym_2 0.740961 0.722961 abs(abs(sub(abs(mul(inv(h_lag_ho_1_ma_hl_1), a... 6 18
sym_1 0.740961 0.716961 abs(abs(sub(abs(mul(inv(h_lag_ho_1_ma_hl_1), a... 6 24
sym_4 0.719074 0.716074 div(h_lag_ho_1_ma_ol_1, h_lag_ho_1_ma_hl_1) 1 3
sym_3 0.740961 0.715961 abs(abs(sub(abs(mul(inv(h_lag_ho_1_ma_hl_1), a... 7 25
sym_8 0.718701 0.714701 abs(div(h_lag_ho_1_ma_ol_1, h_lag_ho_1_ma_hl_1)) 2 4
sym_14 0.718290 0.701290 abs(abs(sub(abs(mul(inv(h_lag_ho_1_ma_hl_1), a... 6 17
sym_10 0.718321 0.700321 abs(abs(sub(abs(mul(inv(h_lag_ho_1_ma_hl_1), a... 6 18
sym_12 0.718306 0.700306 abs(abs(sub(abs(mul(inv(h_lag_ho_1_ma_hl_1), a... 6 18
sym_17 0.718230 0.698230 abs(abs(sub(abs(mul(inv(h_lag_ho_1_ma_hl_1), a... 6 20
sym_18 0.718110 0.697110 abs(abs(sub(abs(mul(inv(h_lag_ho_1_ma_hl_1), a... 6 21
sym_21 0.717920 0.696920 abs(abs(sub(abs(mul(inv(h_lag_ho_1_ma_hl_1), a... 6 21
sym_11 0.718320 0.695320 abs(abs(sub(abs(mul(inv(h_lag_ho_1_ma_hl_1), a... 6 23
sym_13 0.718306 0.695306 abs(abs(sub(abs(mul(inv(h_lag_ho_1_ma_hl_1), a... 6 23
sym_26 0.716284 0.695284 abs(abs(sub(abs(mul(inv(h_lag_ho_1_ma_hl_1), a... 6 21
sym_16 0.718243 0.695243 abs(abs(sub(abs(mul(inv(h_lag_ho_1_ma_hl_1), a... 6 23
sym_15 0.718260 0.694260 abs(abs(sub(abs(mul(inv(h_lag_ho_1_ma_hl_1), a... 6 24
sym_29 0.716206 0.694206 abs(abs(sub(abs(mul(inv(h_lag_ho_1_ma_hl_1), a... 6 22
sym_19 0.718110 0.694110 abs(abs(sub(abs(mul(inv(h_lag_ho_1_ma_hl_1), a... 7 24
sym_20 0.717948 0.693948 abs(abs(sub(abs(mul(inv(h_lag_ho_1_ma_hl_1), a... 6 24
sym_22 0.717909 0.693909 abs(abs(sub(abs(mul(inv(h_lag_ho_1_ma_hl_1), a... 6 24
sym_25 0.716300 0.693300 abs(abs(sub(abs(mul(inv(h_lag_ho_1_ma_hl_1), a... 6 23
sym_27 0.716273 0.693273 abs(abs(sub(abs(mul(inv(h_lag_ho_1_ma_hl_1), a... 6 23
sym_24 0.716404 0.688404 abs(abs(sub(abs(mul(inv(h_lag_ho_1_ma_hl_1), a... 8 28
sym_23 0.716641 0.676641 abs(abs(sub(abs(mul(inv(h_lag_ho_1_ma_hl_1), a... 9 40
sym_28 0.716258 0.671258 abs(abs(sub(abs(mul(inv(h_lag_ho_1_ma_hl_1), a... 9 45
In [27]:
# info about symbolic variables
best_p = list(fun_sym['best_programs'].index)
data_sym = fun_sym['data'][best_p]
display(data_sym.head(5), data_sym.tail(5))
sym_0 sym_2 sym_1 sym_4 sym_3 sym_8 sym_14 sym_10 sym_12 sym_17 ... sym_15 sym_29 sym_19 sym_20 sym_22 sym_25 sym_27 sym_24 sym_23 sym_28
0 0.020076 0.979924 0.979924 0.020076 0.979924 0.020076 0.019816 0.020076 0.020076 0.020076 ... 0.020076 0.020076 0.020076 0.020076 0.020076 0.020076 0.020034 0.020076 0.020076 0.020076
1 0.161496 0.838504 0.838504 0.161496 0.838504 0.161496 0.158886 0.161496 0.161496 0.161496 ... 0.161496 0.161496 0.161496 0.161496 0.161496 0.161496 0.159398 0.161496 0.161496 0.161496
2 0.929223 0.070777 0.070777 0.929223 0.070777 0.929223 0.922050 0.929223 0.929223 0.929223 ... 0.929223 0.929223 0.929223 0.929223 0.929223 0.929223 0.928969 0.929223 0.929223 0.929223
3 0.858063 0.141937 0.141937 0.858063 0.141937 0.858063 0.846462 0.858063 0.858063 0.858063 ... 0.858063 0.858063 0.858063 0.858063 0.858063 0.858063 0.857667 0.858063 0.858063 0.858063
4 1.125382 0.125382 0.125382 1.125382 0.125382 1.125382 1.111605 1.125382 1.125382 1.125382 ... 1.125382 1.125382 1.125382 1.125382 1.125382 1.125382 1.124799 1.125382 1.125382 1.125382

5 rows × 26 columns

sym_0 sym_2 sym_1 sym_4 sym_3 sym_8 sym_14 sym_10 sym_12 sym_17 ... sym_15 sym_29 sym_19 sym_20 sym_22 sym_25 sym_27 sym_24 sym_23 sym_28
1181 0.198344 0.801656 0.801656 0.198344 0.801656 0.198344 0.194678 0.198344 0.198344 0.198344 ... 0.198344 0.198344 0.198344 0.198344 0.198344 0.198344 0.198189 0.198344 0.198344 0.198344
1182 0.369962 0.630038 0.630038 0.369962 0.630038 0.369962 0.360014 0.369962 0.369962 0.369962 ... 0.369962 0.369962 0.369962 0.369962 0.369962 0.369962 0.369703 0.369962 0.369962 0.369962
1183 0.126346 0.873654 0.873654 0.126346 0.873654 0.126346 0.119027 0.126346 0.126346 0.126346 ... 0.126346 0.126346 0.126346 0.126346 0.126346 0.126346 0.125524 0.126346 0.126346 0.126346
1184 0.194798 0.805202 0.805202 0.194798 0.805202 0.194798 0.189931 0.194798 0.194798 0.194798 ... 0.194798 0.194798 0.194798 0.194798 0.194798 0.194798 0.194675 0.194798 0.194798 0.194798
1185 0.475057 0.524943 0.524943 0.475057 0.524943 0.475057 0.466224 0.475057 0.475057 0.475057 ... 0.475057 0.475057 0.475057 0.475057 0.475057 0.475057 0.474894 0.475057 0.475057 0.475057

5 rows × 26 columns

In [103]:
# symbolic expressions (equations) for the generated variables
print('\nA very simple programm: ' + str(fun_sym['best_programs'].index[0]), '\n\n', 'co_d =', fun_sym['best_programs']['expression'][0])
print('\nA not so simple programm: ' + str(fun_sym['best_programs'].index[1]), '\n\n', 'co_d =', fun_sym['best_programs']['expression'][1])
print('\nA very deep programm: ' + str(fun_sym['best_programs'].index[25]), '\n\n', 'co_d =', fun_sym['best_programs']['expression'][25], '\n')
A very simple programm: sym_0 

 co_d = div(ma_ol_1, ma_hl_1)

A not so simple programm: sym_2 

 co_d = abs(abs(sub(abs(mul(inv(h_lag_ho_1_ma_hl_1), abs(h_lag_ho_1_ma_ol_1))), div(mul(log(ma_hl_6), add(ma_ol_1, lag_hl_7)), inv(ma_vol_6)))))

A very deep programm: sym_28 

 co_d = abs(abs(sub(abs(mul(inv(h_lag_ho_1_ma_hl_1), abs(h_lag_ho_1_ma_ol_1))), div(mul(div(add(log(add(h_lag_ol_7_ma_ol_7, h_lag_hl_4_ma_ol_4)), inv(inv(h_lag_vol_6_ma_ho_6))), sub(mul(sub(ma_ho_5, lag_ol_6), mul(h_lag_ol_3_ma_hl_3, h_lag_vol_1_ma_ho_1)), sub(mul(h_lag_ol_3_ma_ho_3, h_lag_vol_3_ma_ol_3), sub(lag_ho_4, ma_hl_2)))), add(ma_ol_1, lag_hl_7)), add(mul(h_lag_ol_3_ma_ho_3, h_lag_vol_6_ma_ho_6), add(h_lag_vol_3_ma_ho_3, ma_hl_6)))))) 

In [13]:
# save founded symbolic features
# dt.data_save_load(p_data_objects={'features': data_sym, 'equations': eq_sym},
#                  p_data_action='save', p_data_file='files/features/oc_symbolic_features_11_65.dat')
In [104]:
# -- ------------------------------------------------------------------------------------ Load variables -- #
# -- --------------------------------------------------------------------------------------------------- -- #
# Load previously generated variables (for reproducibility purposes)
data_sym = dt.data_save_load(p_data_objects=None, p_data_action='load',
                             p_data_file='files/features/oc_symbolic_features_11_65.dat')

6.7 Features Data Concatenation

In [108]:
# datos para utilizar en la siguiente etapa
data = pd.concat([data_ar.copy(), data_had.copy(), data_sym['features'].copy()], axis=1)

# print concatenated data
data
Out[108]:
lag_vol_1 lag_ol_1 lag_ho_1 lag_hl_1 ma_vol_1 ma_ol_1 ma_ho_1 ma_hl_1 lag_vol_2 lag_ol_2 ... sym_1 sym_2 sym_3 sym_5 sym_6 sym_7 sym_8 sym_9 sym_10 sym_11
0 22189.0 151.2 2982.0 3133.2 38320.0 153.9 7512.0 7665.9 1072.0 301.8 ... 1.912716 1.974595 -7358.1 1.662447 1.389499 1.317360 1.872803 2.136681 3.039032 2.933000
1 38320.0 153.9 7512.0 7665.9 77283.0 724.1 3759.6 4483.7 22189.0 151.2 ... 0.738006 0.924535 -3035.5 0.954165 0.833094 0.712792 1.058872 1.399459 1.342292 1.006930
2 77283.0 724.1 3759.6 4483.7 38676.0 5369.7 409.0 5778.7 38320.0 153.9 ... 0.068439 0.077677 4960.7 0.092766 0.096312 0.087139 0.173223 0.170258 0.133278 0.111576
3 38676.0 5369.7 409.0 5778.7 40688.0 5368.9 888.1 6257.0 77283.0 724.1 ... 0.161283 0.146883 4480.8 0.162546 0.188279 0.195884 0.406280 0.328277 0.285550 0.282639
4 40688.0 5368.9 888.1 6257.0 540.0 700.1 -78.0 622.1 38676.0 5369.7 ... -0.018487 -0.018201 778.1 -0.015721 -0.016750 -0.018877 -0.033800 -0.030247 -0.031223 -0.062667
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1181 3655.0 44.0 884.6 928.6 51816.0 443.1 1790.9 2234.0 35328.0 1505.6 ... 1.059621 0.922300 -1347.8 0.889526 0.861942 0.840381 1.738858 1.961197 1.815997 2.293673
1182 51816.0 443.1 1790.9 2234.0 42135.0 969.3 1650.7 2620.0 3655.0 44.0 ... 0.856379 0.858577 -681.4 0.794591 0.780682 0.765910 1.620671 1.504832 1.728880 1.396413
1183 42135.0 969.3 1650.7 2620.0 85099.0 447.1 3091.6 3538.7 51816.0 443.1 ... 1.105103 1.326682 -2644.5 1.376602 1.332031 1.333794 2.237233 2.358350 1.976726 1.667125
1184 85099.0 447.1 3091.6 3538.7 63725.0 678.6 2805.0 3483.6 42135.0 969.3 ... 0.872717 0.944739 -2126.4 1.095284 1.143910 1.127839 1.840120 1.584000 1.371933 1.201516
1185 63725.0 678.6 2805.0 3483.6 52380.0 1550.3 1713.1 3263.4 85099.0 447.1 ... 0.499655 0.530959 -162.8 0.565764 0.639682 0.667091 0.971924 0.861150 0.775067 0.739968

1186 rows × 291 columns

In [109]:
# model data
model_data = dict()

# Whole data separation for train and test
xtrain, xtest, ytrain, ytest = train_test_split(data, data_y, test_size=.2, shuffle=False)

# Data vision inside the dictionary
model_data['train_x'] = xtrain
model_data['train_y'] = ytrain
model_data['test_x'] = xtest
model_data['test_y'] = ytest

print('The training dataset has a length of:', len(xtrain),
      'data points')

print('\nThe test dataset has a length of:', len(xtest),
      'data points')
The training dataset has a length of: 948 data points

The test dataset has a length of: 238 data points

7. Experiments


  1. OLS Regression with ElasticNet

Use the OLS regression with ElasticNet Regularization to produce a numerical value, but only the sign will be taken as the predicted output

  1. Support Vector Machines for Classification

Use the L1 case of support vector machines, testing with differente kernels.

Experiment 1: Ordinary Least Squares Regression with Elastic Net Regularization


\begin{equation} \hat{\beta} = \underset{\beta}{min} \left( |y - \boldsymbol{X} \beta|^2 + \lambda_{2} |\beta|^2 + \lambda_{1} |\beta|_{1} \right) \end{equation}

where:

\begin{equation} |\beta^2| = \sum_{j=1}^{p}\beta_{j}^{2} = \text{Ridge (L2)} \quad , \quad |\beta|_1 = \sum_{j=1}^{p}\beta_{j} = \text{Lasso (L1)} \end{equation}

and the function $(1 - \alpha)|\beta_{1}| + \alpha|\beta|^2$ is called the elastic net penalty which is a convex combination of the lasso and the ridge penalty.

If there is a group of several highly correlated variables, the LASSO tends to select just one variable from the group and drop the others, this can be at times very useful to overcome correlation among explanatory variables, but can be unproductive since completly droping the variables can have a negative impact in the predictive power of the features.

And if there is a group of variables with a large coefficient, maybe due to overfitting, the RIDGE tends to reduce all of them and produce lower values for the regressors.

$\alpha$ will be the ratio of the elastic net penalty applied to the model of Ordinary Least Squares.

In [110]:
en_parameters = {'alpha': 11.9, 'ratio': .08}
elastic_net = fn.ols_elastic_net(p_data=model_data, p_params=en_parameters)
In [111]:
# Model accuracy (in of sample)
in_en_acc = round(elastic_net['metrics']['train']['acc']*100, 2)
print('The model accuracy with train data was: ', in_en_acc, '%')

# Model accuracy (out of sample)
out_en_acc = round(elastic_net['metrics']['test']['acc']*100, 2)
print('\nThe model accuracy with test data was: ', out_en_acc,'%')
The model accuracy with train data was:  44.83 %

The model accuracy with test data was:  39.5 %
In [112]:
# get train and test y data 
train_y = elastic_net['results']['data']['train']
test_y = elastic_net['results']['data']['test']

# build dict for the plot
ohlc_class = {'train_y': train_y['y_train'],
              'train_y_pred': train_y['y_train_pred'],
              'test_y': test_y['y_test'],
              'test_y_pred': test_y['y_test_pred']}

# make plot
plot_en = vs.g_ohlc_class(p_ohlc=data_ohlc,
                          p_theme=dt.theme_plot_3,
                          p_data_class=ohlc_class,
                          p_vlines=dates_folds)

# visualize plot
plot_en.show()

We can notice very poor results, both in the training period (left to the vertical line) and the testing period (right to the vertical line)


Experiment 2: L1 Support Vector Machines


A supervised learning method used for classification, regression and outliers detection.

The advantages of support vector machines are:

  • Effective in high dimensional spaces.
  • Still effective in cases where number of dimensions is greater than the number of samples.
  • Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.

The disadvantages of support vector machines include:

  • If the number of features is much greater than the number of samples, to avoid over-fitting in choosing Kernel functions and regularization term, is a crucial aspect to address.
  • SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation.

A support vector machine constructs a hyper-plane, or set of hyper-planes, in a high dimensional space called Hilbert space. A good separation is achieved by the hyper-plane that has the largest distance to the nearest training data points of any class (so-called functional margin) since, in general, the larger the margin the lower the generalization error of the classifier.

The next figure below shows the decision function for a linearly separable problem, with three samples on the margin boundaries, called support vectors

But, when the problem isn’t linearly separable, as most of the cases in the real applications, SVM address non-linearly separable cases by introducing two concepts: Soft Margin and Kernel Tricks.

  • Soft Margin: try to find a line to separate, but tolerate one or few misclassified dots (e.g. the dots circled in red)
  • Kernel Trick: try to find a non-linear decision boundary

The support vectors are the samples within the margin boundaries. And there are tolerated two types of errors:

  • The dot is on the wrong side of the decision boundary but on the correct side/ on the margin (shown in left)
  • The dot is on the wrong side of the decision boundary and on the wrong side of the margin (shown in right)

For the kernel tricks, we will address three types of them:

  • Linear (linear) : $K(x_{k}, y_{l}) = \left\langle x_{k}, x_{l} \right\rangle$

  • Polynomial (poly) : $K(x_{k},y_{l}) = \left( x_{k}^Tx_{l} + c \right)^d$

  • Radial Based Function (rbf) : $K(x_{k}, x_{l}) = \exp \left( -\frac{|| x_{k} - x_{l} ||^2}{2 \sigma^2} \right)$

We will be using the non-separable case, and explore the 3 types of kernels to build a classificator with SVM

7.2.1 Mathematical formulation (Primal)

The L1 Support Vector Machines formulation for a classification problem is the following:

\begin{equation} \underset{w, b, \xi}{min} \quad P(w, \xi) = \frac{1}{2} w^T w + c \sum_{k=1}^{N} \xi_k \quad s.t. \quad y_{k} \left[ w^T \varphi(x_k) + b \right] \geq 1 - \xi_k, \quad \xi_k \geq 0, \quad k = 1, 2, ... , N. \end{equation}

where: \ $y_k \in \left\{-1, 1\right\}$ : the target variables. \ $\xi_k \in \mathbb{R}^n$ : Slack variables. \ $\phi(x_k)$ : Feature Hyperspace mapping of the form $\varphi(\cdot) : \mathbb{R}^n \rightarrow \mathbb{R}^m$. \ $w \in R^m$ : Model weights (Hyperparameter). \ $c > 0$ : Regularization coefficient (Hyperparameter). \ $\gamma > 0$ : Model (Hyperparameter). \ $b \in R$ : Model (Hyperparameter). \

7.2.1 Mathematical formulation (Dual)

\begin{equation} L(w,b,\xi, \alpha, \lambda) = \frac{1}{2} w^T w - \sum_{k=1}^{N} \alpha_k [y_k (w^T \varphi(x_k) + b) -1] - \sum_{k=1}^{N} \alpha_k \xi_k + c \sum_{k=1}^{N} \xi_k - \sum_{k=1}^{N} \lambda_k \xi_k \end{equation}

And the decision function for a given sample $x$ becomes:

\begin{equation} \sum_{i \in SV}^{N} y_{i} \alpha_{i} K(x_{i}, x) + b \end{equation}

and the predicted class correspond to its sign. We only need to sum over the support vectors (i.e. the samples that lie within the margin) because the dual coefficients are zero for the other samples.

where: \ $y_k \in \left\{-1, 1\right\} \rightarrow \left\{\textit{price goes down}, \textit{price goes up} \right\}$ : \ $\xi_k \in \mathbb{R}^n$ : Used internally in the $\textit{ls_svm}$ \ $w \in R^m$ : Used internally in the $\textit{ls_svm}$ \ $b \in R$ : Used internally in the $\textit{ls_svm}$ \ $c > 0$ : Inverse regularization coefficient $\in [0,1]$ \ $\phi(x_k, x_{l})$ : Kernel $\in [linear, rbf]$ \ $\gamma > 0$ : Kernel coefficient for $\textit{rbf}$ $\in [0, 1]$

$\phi(x_k, x_{l})$, $\gamma$, $c$ are the model's Hyperparameters we have to choose

In this work, three types of Kernels where tested, linear, Radial Based Function (RBF) and Polynomial

LINEAR Kernel results

Linear: $K(x_{k}, y_{l}) = \left\langle x_{k}, x_{l} \right\rangle$

In [113]:
l1_svm_linear_params = {'kernel': 'linear', 'gamma': 'auto', 'c': 1.5, 'degree': 0, 'coef0': 0}
# gamma='scale' -> 1/(n_features * X.var())
# gamma=‘auto’  -> 1/n_features

l1_svm_linear = fn.l1_svm(p_data=model_data, p_params=l1_svm_linear_params)
In [114]:
# get train and test y data 
train_y = l1_svm_linear['results']['data']['train']
test_y = l1_svm_linear['results']['data']['test']

# build dict for the plot
ohlc_class = {'train_y': train_y['y_train'], 'train_y_pred': train_y['y_train_pred'],
              'test_y': test_y['y_test'], 'test_y_pred': test_y['y_test_pred']}

# plot title
dt.theme_plot_4['p_labels']['title'] = 'L1-SVM (LINEAR) Model Results'

# make plot
plot_svm_linear = vs.g_ohlc_class(p_ohlc=data_ohlc, p_theme=dt.theme_plot_4,
                           p_data_class=ohlc_class, p_vlines=dates_folds)

# visualize plot
plot_svm_linear.show()
In [115]:
# Model accuracy (in sample)
in_svm_acc = round(l1_svm_linear['metrics']['train']['acc']*100, 2)
print('The model accuracy with train data was: ', in_svm_acc, '%')

# Model accuracy (out of sample)
out_svm_acc = round(l1_svm_linear['metrics']['test']['acc']*100, 2)
print('\nThe model accuracy with test data was: ', out_svm_acc,'%')
The model accuracy with train data was:  59.92 %

The model accuracy with test data was:  57.98 %
In [116]:
# get the support vectors information
support_vectors = l1_svm_linear['model'].n_support_
print('The number of support vector for class -1 are:', support_vectors[0])
print('\nThe number of support vector for class 1 are:', support_vectors[1])
The number of support vector for class -1 are: 226

The number of support vector for class 1 are: 267

RBF Kernel results

Radial Basis Function (RBF) : \

$K(x_{k}, x_{l}) = \exp \left( -\frac{|| x_{k} - x_{l} ||^2}{2 \sigma^2} \right) = exp(-\gamma || x_{k}-x_{l} ||^2)$ \

where: \ $\gamma = \frac{1}{2 \sigma^2} > 0: \text{gamma}$

In [117]:
l1_svm_rbf_params = {'kernel': 'rbf', 'gamma': 'scale', 'c': 1.5, 'degree': 0, 'coef0': 0}
# gamma='scale' -> 1/(n_features * X.var())
# gamma=‘auto’  -> 1/n_features

l1_svm_rbf = fn.l1_svm(p_data=model_data, p_params=l1_svm_rbf_params)

When training an SVM with the Radial Basis Function (RBF) kernel, two parameters must be considered:

  • The parameter C, common to all SVM kernels, trades off misclassification of training examples against simplicity of the decision surface. A low C makes the decision surface smooth, while a high C aims at classifying all training examples correctly.

  • The parameter gamma defines how much influence a single training example has. The larger gamma is, the closer other examples must be to be affected.

In [118]:
# get train and test y data 
train_y = l1_svm_rbf['results']['data']['train']
test_y = l1_svm_rbf['results']['data']['test']

# build dict for the plot
ohlc_class = {'train_y': train_y['y_train'], 'train_y_pred': train_y['y_train_pred'],
              'test_y': test_y['y_test'], 'test_y_pred': test_y['y_test_pred']}

# plot title
dt.theme_plot_4['p_labels']['title'] = 'L1-SVM (RBF) Model Results'

# make plot
plot_svm = vs.g_ohlc_class(p_ohlc=data_ohlc, p_theme=dt.theme_plot_4,
                           p_data_class=ohlc_class, p_vlines=dates_folds)

# visualize plot
plot_svm.show()

Notice the good results in the training period (lef to the vertical line), and decent results in testing period (right to the vertical line), specially, the prediction errors in the uptrend of the prices.

In [119]:
# Model accuracy (in sample)
in_svm_acc = round(l1_svm_rbf['metrics']['train']['acc']*100, 2)
print('The model accuracy with train data was: ', in_svm_acc, '%')

# Model accuracy (out of sample)
out_svm_acc = round(l1_svm_rbf['metrics']['test']['acc']*100, 2)
print('\nThe model accuracy with test data was: ', out_svm_acc,'%')
The model accuracy with train data was:  85.65 %

The model accuracy with test data was:  78.57 %
In [120]:
# get the support vectors information
support_vectors = l1_svm_rbf['model'].n_support_
print('The number of support vector for class -1 are:', support_vectors[0])
print('\nThe number of support vector for class 1 are:', support_vectors[1])
The number of support vector for class -1 are: 271

The number of support vector for class 1 are: 263

Polynomial Kernel results

Polynomial Function : \

$K(x_{k},y_{l}) = (x_{k}^Tx_{l} + c)^d = (\gamma \left\langle \varphi(x_{k}), \varphi(x_{l}) \right\rangle + r )^d$ \

where: \ $\gamma > 0: \text{gamma}$ \ $d : \text{degree}$ \ $r : \text{coef0}$

In [121]:
l1_svm_poly_params = {'kernel': 'poly', 'gamma': 'scale', 'c': 1.5, 'degree': 2, 'coef0': 0}
# gamma='scale' -> 1/(n_features * X.var())
# gamma=‘auto’  -> 1/n_features

l1_svm_poly = fn.l1_svm(p_data=model_data, p_params=l1_svm_poly_params)

One problem with the polynomial kernel is that it may suffer from numerical instability: when $x^T y + c < 1$, $K(x, y) = (x^T y + c)^d$ tends to zero with increasing $d$, whereas when $x^T y + c > 1$, $K(x, y)$ tends to infinity. The most common degree is $d = 2$ (quadratic), since larger degrees tend to overfit.

In [122]:
# get train and test y data 
train_y = l1_svm_poly['results']['data']['train']
test_y = l1_svm_poly['results']['data']['test']

# build dict for the plot
ohlc_class = {'train_y': train_y['y_train'], 'train_y_pred': train_y['y_train_pred'],
              'test_y': test_y['y_test'], 'test_y_pred': test_y['y_test_pred']}

# plot title
dt.theme_plot_4['p_labels']['title'] = 'L1-SVM (POLYNOMIAL) Model Results'

# make plot
plot_svm = vs.g_ohlc_class(p_ohlc=data_ohlc, p_theme=dt.theme_plot_4,
                           p_data_class=ohlc_class, p_vlines=dates_folds)

# visualize plot
plot_svm.show()

Notice the good results in the training period (lef to the vertical line), and also, in contrast with the rbf kernel, the good and stable results in testing period (right to the vertical line), specially, in the uptrend of the prices.

In [123]:
# Model accuracy (in sample)
in_svm_acc = round(l1_svm_poly['metrics']['train']['acc']*100, 2)
print('The model accuracy with train data was: ', in_svm_acc, '%')

# Model accuracy (out of sample)
out_svm_acc = round(l1_svm_poly['metrics']['test']['acc']*100, 2)
print('\nThe model accuracy with test data was: ', out_svm_acc,'%')
The model accuracy with train data was:  82.7 %

The model accuracy with test data was:  83.19 %
In [124]:
# get the support vectors information
support_vectors = l1_svm_poly['model'].n_support_
print('The number of support vector for class -1 are:', support_vectors[0])
print('\nThe number of support vector for class 1 are:', support_vectors[1])
The number of support vector for class -1 are: 284

The number of support vector for class 1 are: 287

8. Discussion and conclusions


8.1 Feature Engineering

It was very useful to generate the symbolic features, besides the autoregressive and hadamard features. We began having 4 time series (OHLC) to 300 explanatory variables, and all of them where scaled with a standarization process. This was crucial since both OLS and SVM are algorithms that are not scale invariant, so it is highly recommended to scale the data.

8.2 Accuracy

We had a better performance with a polynomial kernel, but very similar to the radial based function kernel, therefore, even that the explanatory variables are both linear and non linear representations of the phenomena (price movement), it was necessary a non linear kernel. That we can conclude since the linear kernel performed very poorly.

8.3 Financial Timeseries Classification

Perhaps the of the most important conclusion would be the following: We have validated that there is a series of considerations, steps and a general problem approach for the timeseries prediction problem. By looking to predict the sign of the difference of prices, over the quantity in the price exchange rate, we restated a regression problem as a classification problem and it was useful doing that. Also, that the feature engineering process, data scalation and the proposed models and hyperparameters were very good choices, since produced decent accuracy rates out of sample.

9. Bibliography


(Hastie, Tibshirani, & Friedman, 2009). The Elements of Statistical Learning, 2nd edition

(vapnik, 1992), The nature of statistical learning.

(Boyd,Vandenberghe, 2004). Convex Optimization. cambridge university press

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, Édouard Duchesnay. 2011. Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research, 12, 2825-2830, 2011.